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Abstract 


Recently, fully-connected and convolutional 
neural networks have been trained to achieve 
state-of-the-art performance on a wide vari- 
ety of tasks such as speech recognition, im- 
age classification, natural language process- 
ing, and bioinformatics. For classification 
tasks, most of these “deep learning” models 
employ the softmax activation function for 
prediction and minimize cross-entropy loss. 
In this paper, we demonstrate a small but 
consistent advantage of replacing the soft- 
max layer with a linear support vector ma- 
chine. Learning minimizes a margin-based 
loss instead of the cross-entropy loss. While 
there have been various combinations of neu- 
ral nets and SVMs in prior art, our results 
using L2-SVMs show that by simply replac- 
ing softmax with linear SVMs gives signifi- 
cant gains on popular deep learning datasets 
MNIST, CIFAR-10, and the ICML 2013 Rep- 
resentation Learning Workshop’s face expres- 
sion recognition challenge. 


1. Introduction 


Deep learning using neural networks have claimed 
state-of-the-art performances in a wide range of tasks. 
These include (but not limited to) speech (Mohamed 


: 


. All of the above mentioned 
papers use the softmax activation function (also known 
as multinomial logistic regression) for classification. 







Support vector machine is an widely used alternative 
to softmax for classification (1992). Using 
SVMs (especially linear) in combination with convolu- 
tional nets have been proposed in the past as part of a 
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multistage process. In particular, a deep convolutional 
net is first trained using supervised/unsupervised ob- 
jectives to learn good invariant hidden latent represen- 
tations. The corresponding hidden variables of data 
samples are then treated as input and fed into linear 


(or kernel) SVMs (Huang & LeCun| |2006} |Lee et al. 
2009} 2010} [Coates et al.| [2011). This 


technique usually improves performance but the draw- 
back is that lower level features are not been fine-tuned 
w.r.t. the SVM’s objective. 


Other papers have also proposed similar models but 
with joint training of weights at lower layers using 
both standard neural nets as well as convolutional neu- 


ral nets (Zhong & Ghosh| |2000} |Collobert & Bengio 
; {Nagi et al.||2012). In other related works, 


(2008) proposed a semi-supervised embed- 
ding algorithm for deep learning where the hinge loss 
is combined with the “contrastive loss” from siamese 


networks (Hadsell et al.} |2006). Lower layer weights 


are learned using stochastic gradient descent. 
(2012) learns a recursive representation using lin- 


ear SVMs at every layer, but without joint fine-tuning 
of the hidden representation. 












In this paper, we show that for some deep architec- 
tures, a linear SVM top layer instead of a softmax 
is beneficial. We optimize the primal problem of the 
SVM and the gradients can be backpropagated to learn 
lower level features. Our models are essentially same 
as the ones proposed in 


2012), with the minor novelty of using the loss 
from the L2-SVM instead of the standard hinge loss. 


Unlike the hinge loss of a standard SVM, the loss for 
the L2-SVM is differentiable and penalizes errors much 
heavily. The primal L2-SVM objective was proposed 


3 years before the invention of SVMs 1989)! 


A similar objective and its optimization are also dis- 
cussed by (Lee & Mangasarian| |2001). 


Compared to nets using a top layer softmax, 
we demonstrate superior performance on MNIST, 
CIFAR-10, and on a recent Kaggle competition on 
recognizing face expressions. Optimization is done us- 
ing stochastic gradient descent on small minibatches. 
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Comparing the two models in Sec. we believe the 
performance gain is largely due to the superior regu- 
larization effects of the SVM loss function, rather than 
an advantage from better parameter optimization. 


2. The model 
2.1. Softmax 


For classification problems using deep learning tech- 
niques, it is standard to use the softmax or 1-of-K 
encoding at the top. For example, given 10 possible 
classes, the softmax layer has 10 nodes denoted by p;, 
where 7 = 1,...,10. p; specifies a discrete probability 
distribution, therefore, ay pi =. 


Let h be the activation of the penultimate layer nodes, 
W is the weight connecting the penultimate layer to 
the softmax layer, the total input into a softmax layer, 
given by a, is 


a; = >> heWr, (1) 
k 


then we have 


_ exp(a;) 9 
Di FH exp(a;) (2) 


The predicted class 7 would be 
i = arg max pj 


= arg max a; (3) 


2.2. Support Vector Machines 


Linear support vector machines (SVM) is originally 
formulated for binary classification. Given train- 
ing data and its corresponding labels (xp,Yn), n = 
1,...,N, X, € R?, t, € {-1,+1}, SVMs learning 
consists of the following constrained optimization: 


N 
Le 
min ~-w w+C i 4 
wien 2 ds (4) 
s.t. W'Xntn >1—t, Vn 
En, >0 Vn 


€, are slack variables which penalizes data points 
which violate the margin requirements. Note that we 
can include the bias by augment all data vectors x, 
with a scalar value of 1. The corresponding uncon- 
strained optimization problem is the following: 


N 
1 
min aww +C S- max(1—w!Xptn,0) (5) 


n=1 


The objective of Eq. |5}is known as the primal form 
problem of L1-SVM, with the standard hinge loss. 
Since L1-SVM is not differentiable, a popular variation 
is known as the L2-SVM which minimizes the squared 
hinge loss: 
1 N 

min 5wiw + pe max(1—w!Xptn,0)? — (6) 
L2-SVM is differentiable and imposes a_ bigger 
(quadratic vs. linear) loss for points which violate the 
margin. To predict the class label of a test data x: 


arg max(w!x)t (7) 
t 


For Kernel SVMs, optimization must be performed in 
the dual. However, scalability is a problem with Kernel 
SVMs, and in this paper we will be only using linear 
SVMs with standard deep learning models. 


2.3. Multiclass SVMs 


The simplest way to extend SVMs for multiclass prob- 
lems is using the so-called one-vs-rest approach (Vap- 


1995). For K class problems, AK linear SVMs 


will be trained independently, where the data from 


the other classes form the negative cases. 
(2002) discusses other alternative multiclass SVM ap- 


proaches, but we leave those to future work. 


Denoting the output of the k-th SVM as 


a,(x) = w'x (8) 
The predicted class is 
arg max a4,(x) (9) 
k 


Note that prediction using SVMs is exactly the same 
as using a softmax Eq.|3} The only difference between 
softmax and multiclass SVMs is in their objectives 
parametrized by all of the weight matrices W. Soft- 
max layer minimizes cross-entropy or maximizes the 
log-likelihood, while SVMs simply try to find the max- 
imum margin between data points of different classes. 


2.4. Deep Learning with Support Vector 
Machines 


Most deep learning methods for classification using 
fully connected layers and convolutional layers have 
used softmax layer objective to learn the lower level 
parameters. There are exceptions, notably in papers 
by (Zhong & Ghosh 
Nagi et al.|/2012), supervised embedding with nonlin- 
ear NCA (Salakhutdinov & Hinton) 2007), and semi- 
supervised deep embedding (Weston et al.}|2008). In 
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this paper, we use L2-SVM’s objective to train deep 
neural nets for classification. Lower layer weights are 
learned by backpropagating the gradients from the top 
layer linear SVM. To do this, we need to differentiate 
the SVM objective with respect to the activation of 
the penultimate layer. Let the objective in Eq. [5] be 
I(w), and the input x is replaced with the penultimate 
activation h, 


al(w) 
dh, 





=w-Ct,w(I{1>w'hjtn}) (10) 
Where I{-} is the indicator function. Likewise, for the 
L2-SVM, we have 


Ol(w) 
Oh, 





= w — 2Ct,w(max(l—w'hyjtn,0)) (11) 


From this point on, backpropagation algorithm is ex- 
actly the same as the standard softmax-based deep 
learning networks. We found L2-SVM to be slightly 
better than L1-SVM most of the time and will use the 
L2-SVM in the experiments section. 


3. Experiments 
3.1. Facial Expression Recognition 


This competition/challenge was hosted by the ICML 
2013 workshop on representation learning, organized 
by the LISA at University of Montreal. The contest 
itself was hosted on Kaggle with over 120 competing 
teams during the initial developmental period. 


The data consist of 28,709 48x48 images of faces under 
7 different types of expression. See Fig [I] for examples 
and their corresponding expression category. The val- 
idation and test sets consist of 3,589 images and this 
is a classification task. 


WINNING SOLUTION 


We submitted the winning solution with a public val- 
idation score of 69.4% and corresponding private test 
score of 71.2%. Our private test score is almost 2% 
higher than the 2nd place team. Due to label noise 
and other factors such as corrupted data, human per- 
formance is roughly estimated to be between 65% and 
68%] 

Our submission consists of using a simple Convolu- 
tional Neural Network with linear one-vs-all SVM at 
the top. Stochastic gradient descent with momentum 
is used for training and several models are averaged to 
slightly improve the generalization capabilities. Data 


‘Personal communication from the competition orga- 


nizers: http://bit.ly/13Zr6Gs 
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Figure 1. Training data. Each column consists of faces of 
the same expression: starting from the leftmost column: 
Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral. 





re 


preprocessing consisted of first subtracting the mean 
value of each image and then setting the image norm 
to be 100. Each pixels is then standardized by remov- 
ing its mean and dividing its value by the standard 
deviation of that pixel, across all training images. 


Our implementation is in C++ and CUDA, with ports 
to Matlab using MEX files. Our convolution routines 
used fast CUDA kernels written by Alex Krizhevsky?| 
The exact model parameters and code is provided 
on by the author at https://code.google.com/p/deep- 
learning-faces. 


3.1.1. SorrmMax vs. DLSVM 


We compared performances of softmax with the deep 
learning using L2-SVMs (DLSVM). Both models are 
tested using an 8 split/fold cross validation, with a 
image mirroring layer, similarity transformation layer, 
two convolutional filtering+pooling stages, followed by 
a fully connected layer with 3072 hidden penultimate 
hidden units. The hidden layers are all of the rectified 
linear type. other hyperparameters such as weight de- 
cay are selected using cross validation. 


We can also look at the validation curve of the Soft- 
max vs L2-SVMs as a function of weight updates in 
Fig. As learning rate is lowered during the latter 
half of training, DLSVM maintains a small yet clear 
performance gain. 


We also plotted the 1st layer convolutional filters of 
the two models: 


While not much can be gain from looking at these 
filters, SVM trained conv net appears to have more 


“http: //code.google.com/p/cuda-convnet 
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Softmax | DLSVM L2 
Training cross validation | 67.6% 68.9% 
Public leaderboard 69.3% 69.4% 
Private leaderboard 70.1% 71.2% 











Table 1. Comparisons of the models in terms of % accu- 
racy. Training c.v. is the average cross validation accuracy 
over 8 splits. Public leaderboard is the held-out valida- 
tion set scored via Kaggle’s public leaderboard. Private 
leaderboard is the final private leaderboard score used to 
determine the competition’s winners. 
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Figure 2. Cross validation performance of the two models. 
Result is averaged over 8 folds. 


textured filters. 


3.2. MNIST 


MNIST is a standard handwritten digit classification 
dataset and has been widely used as a benchmark 
dataset in deep learning. It is a 10 class classification 
problem with 60,000 training examples and 10,000 test 
cases. 


We used a simple fully connected model by first per- 
forming PCA from 784 dimensions down to 70 dimen- 
sions. Two hidden layers of 512 units each is followed 
by a softmax or a L2-SVM. The data is then divided up 
into 300 minibatches of 200 samples each. We trained 
using stochastic gradient descent with momentum on 
these 300 minibatches for over 400 epochs, totaling 
120K weight updates. Learning rate is linearly decayed 
from 0.1 to 0.0. The L2 weight cost on the softmax 
layer is set to 0.001. To prevent overfitting and criti- 
cal to achieving good results, a lot of Gaussian noise is 
added to the input. Noise of standard deviation of 1.0 
(linearly decayed to 0) is added. The idea of adding 


Gaussian noise is taken from these papers (Raiko et al. 
oT Rift eta] BOTIB). 
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Figure 3. Filters from convolutional net with softmac. 
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Figure 4. Filters from convolutional net with L2-SVM. 










Our learning algorithm is permutation invariant with- 
out any unsupervised pretraining and obtains these 
results: Softmax: 0.99% DLSVM: 0.87% 


An error of 0.87% on MNIST is probably (at this time) 
state-of-the-art for the above learning setting. The 
only difference between softmax and DLSVM is the 
last layer. This experiment is mainly to demonstrate 
the effectiveness of the last linear SVM layer vs. the 
softmax, we have not exhaustively explored other com- 
monly used tricks such as Dropout, weight constraints, 
hidden unit sparsity, adding more hidden layers and 
increasing the layer size. 


3.3. CIFAR-10 


Canadian Institute For Advanced Research 10 dataset 
is a 10 class object dataset with 50,000 images for 
training and 10,000 for testing. The colored images 
are 32 x 32 in resolution. We trained a Convolutional 
Neural Net with two alternating pooling and filtering 
layers. Horizontal reflection and jitter is applied to 
the data randomly before the weight is updated using 
a minibatch of 128 data cases. 


The Convolutional Net part of both the model is fairly 
standard, the first C layer had 32 5x5 filters with Relu 
hidden units, the second C layer has 64 5 x 5 filters. 
Both pooling layers used max pooling and downsam- 
pled by a factor of 2. 


The penultimate layer has 3072 hidden nodes and uses 
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Relu activation with a dropout rate of 0.2. The dif- 
ference between the Convnet+Softmax and ConvNet 
with L2-SVM is the mainly in the SVM’s C constant, 
the Softmax’s weight decay constant, and the learning 
rate. We selected the values of these hyperparameters 
for each model separately using validation. 





ConvNet+SVM 
11.9% 


ConvNet+Softmax 
14.0% 





Test error 











Table 2. Comparisons of the models in terms of % error on 
the test set. 


In literature, the state-of-the-art (at the time of writ- 
ing) result is around 9.5% by (Snoeck et al. 2012). 
However, that model is different as it includes con- 
trast normalization layers as well as used Bayesian op- 
timization to tune its hyperparameters. 


3.4. Regularization or Optimization 


To see whether the gain in DLSVM is due to the su- 
periority of the objective function or to the ability to 
better optimize, We looked at the two final models’ 
loss under its own objective functions as well as the 
other objective. The results are in Table [3] 




















ConvNet | ConvNet 

+Softmax | +SVM 
Test error 14.0% 11.9% 
Avg. cross entropy 0.072 0.353 
Hinge loss squared 213.2 0.313 





Table 3. Training objective including the weight costs. 


It is interesting to note here that lower cross entropy 
actually led a higher error in the middle row. In ad- 
dition, we also initialized a ConvNet+Softmax model 
with the weights of the DLSVM that had 11.9% error. 
As further training is performed, the network’s error 
rate gradually increased towards 14%. 


This gives limited evidence that the gain of DLSVM 
is largely due to a better objective function. 


4. Conclusions 


In conclusion, we have shown that DLSVM works bet- 
ter than softmax on 2 standard datasets and a recent 
dataset. Switching from softmax to SVMs is incredibly 
simple and appears to be useful for classification tasks. 
Further research is needed to explore other multiclass 
SVM formulations and better understand where and 
how much the gain is obtained. 
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